Triton SM90: AxisInfoAnalysis 维度塌缩与 C++ 底层越界诊断

@Qubitium @colesbury I’ve conducted a logical audit on this Segmentation Fault. The crash during importlib.external.create_module is a definitive Thread-Safety Violation triggered by the impedance mismatch between Triton’s legacy C-extension architecture and Python 3.13t’s free-threading (no-GIL) runtime.

1. The Root Cause: Race Condition in Module Initialization

In standard Python, the Import Lock (GIL) serializes the loading of C-extensions. Under 3.13t, this protection is gone:

The Trigger: Your code triggers @triton.autotune and @triton.jit during module import. This activates compile_module_from_src, which attempts to dynamically create and link a C++ wrapper module for the generated CUDA kernel.
The Failure: Triton’s backend (specifically the driver and JIT compiler) heavily relies on Global Static State to manage the CUDA Driver Context and LLVM PassManager.
The Segfault: Without GIL serialization, concurrent import attempts or background autotuning threads lead to a race condition where create_module attempts to register symbols against a partially initialized or clobbered pointer, resulting in a memory access violation.

2. Lack of Multi-phase Initialization (PEP 489) Support

Triton’s C-extension has likely not yet adopted Multi-phase Initialization and has not been marked as supporting free-threading (via Py_MOD_GIL_NOT_USED). Consequently, the internal memory allocators (like mimalloc in Python 3.13t) and the CUDA Driver API may experience conflicts when Triton attempts to bridge its un-isolated C++ state with the new lock-free interpreter state.

Temporary Workaround

Until Triton formally refactors its C-extension to be thread-isolated (removing global static pointers in driver.py), you must force the interpreter to restore the GIL:

PYTHON_GIL=1 CUDA_VISIBLE_DEVICES=7 pytest test_mimo.py

If the code executes without error under PYTHON_GIL=1, it confirms that the Segfault is strictly a consequence of Un-isolated Concurrency within the Triton C-API layer.

Verdict: This is an upstream compatibility gap in Triton’s backend rather than a bug in GPTQModel.